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Abstract — Ensemble learning aims to improve generalization 
ability by using multiple base learners. It is well-known that 
to construct a good ensemble, the base learners should be 
accurate as well as diverse. In this paper, unlabeled data is 
exploited to facilitate ensemble learning by helping augment 
the diversity among the base learners. Specifically, a semi- 
supervised ensemble method named Udeed is proposed. Unlike 
existing semi-supervised ensemble methods where error-prone 
pseudo-labels are estimated for unlabeled data to enlarge the 
labeled data to improve accuracy, Udeed works by maximizing 
accuracies of base learners on labeled data while maximizing 
diversity among them on unlabeled data. Experiments show 
that Udeed can effectively utilize unlabeled data for ensemble 
learning and is highly competitive to well-established semi- 
supervised ensemble methods. 
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I. Introduction 

In ensemble learning |8||, a number of base learners are 
trained and then combined for prediction to achieve strong 
generalization ability. Numerous effective ensemble methods 
have been proposed, such as BOOSTING ||9l, BAGGING E], 
Stacking fl^, etc., and most of these methods work under 
the supervised setting where the labels of training examples 
are known. In many real-world tasks, however, unlabeled 
training examples are readily available while obtaining their 
labels would be fairly expensive. Semi-supervised learning 
Q is a major paradigm to exploit unlabeled data together 
with labeled training data to improve learning performance 
automatically, without human intervention. 

This paper deals with semi-supervised ensembles, that 
is, ensemble learning with labeled and unlabeled data. In 
contrast to the huge volume of literatures on ensemble 
learning and on semi-supervised learning, only a few work 
has been devoted to the study of semi-supervised ensembles. 
As indicated by Zhou |20|, this was caused by the different 
philosophies of the ensemble learning community and the 
semi-supervised learning community. The ensemble learning 
community believes that it is able to boost the performance 
of weak learners to strong learners by using multiple learn- 
ers, and so there is no need to use unlabeled data; while the 
semi-supervised learning community believes that it is able 
to boost the performance of weak learners to strong learners 
by exploiting unlabeled data, and so there is no need to use 
multiple learners. However, as Zhou indicated ll20l . there are 



several important reasons why ensemble learning and semi- 
supervised learning are actually mutually beneficial, among 
which an important one is that by considering unlabeled data 
it is possible to help augment the diversity among the base 
learners, as explained in the following paragraph. 

It is well-known that the generalization error of an en- 
semble is related to the average generalization error of the 
base learners and the diversity among the base learners. 
Generally, the lower the average generalization error (or, 
the higher the average accuracy) of the base learners and 
the higher the diversity among the base learners, the better 
the ensemble fTT). Previous ensemble methods work under 
supervised setting, trying to achieve a high average accuracy 
and a high diversity by using the labeled training set. It 
is noteworthy, however, pursuing a high accuracy and a 
high diversity may suffer from a dilemma. For example, 
for two classifiers which have perfect performance on the 
labeled training set, they would not have diversity since there 
is no difference between their predictions on the training 
examples. Thus, to increase the diversity needs to sacrifice 
the accuracy of one classifier However, when we have 
unlabeled data, we might find that these two classifiers 
actually make different predictions on unlabeled data. This 
would be important for ensemble design. For example, given 
two pairs of classifiers, (A, B) and (C, D), if we know that 
all of them are with 100% accuracy on labeled training data, 
then there will be no difference taking either the ensemble 
consisting of {A, B) or the ensemble consisting of (C, D); 
however, if we find that A and B make the same predictions 
on unlabeled data, while C and D make different predictions 
on some unlabeled data, then we will know that the ensemble 
consisting of (C, D) should be better. So, in contrast to 
previous ensemble methods which focus on achieving both 
high accuracy and high diversity using only the labeled data, 
the use of unlabeled data would open a promising direction 
for designing new ensemble methods. 

In this paper, we propose the Udeed {Unlabeled Data to 
Enhance Ensemble Diversity) approach. Experiments show 
that by using unlabeled data for diversity augmentation, 
Udeed achieves much better performance than its counter- 
part which does not consider the usefulness of unlabeled 
data. Moreover, Udeed also achieves highly comparable 
performance to other state-of-the-art semi-supervised ensem- 



ble methods. 

The rest of this paper is organized as follows. Section 
briefly reviews related work on semi-supervised ensembles. 
Section Uni presents Udeed. Section HV] reports our experi- 
mental results. Finally, Section [V] concludes. 

II. Related Work 

As mentioned before, in contrast to the huge volume 
of literatures on ensemble learning and on semi-supervised 
learning, only a few work has been devoted to the study of 
semi-supervised ensembles. 

Zhou and Li [21 1 proposed the Tri-TRAINING approach 
which uses three classifiers and in each round if two classi- 
fiers agree on an unlabeled instance while the third classifier 
disagrees, then the two classifiers, under a certain condition, 
will label this unlabeled instance for the third classifier; 
the three classifiers are voted to make prediction. This 
is a disagreement-based semi-supervised learning approach 
II22I . which can be viewed as a variant of the famous 
co-training method f3l. Later, Li and Zhou [14] extended 
Tri-TRAINING to Co-FOREST, by including more base 
classifiers and in each round the majority teach minority 
strategy is still adopted. 

In addition to Tri-TRAINING and Co-forest, there are 
several semi-supervised boosting methods ||T|, ||6|, Q, lfT6l . 
pSl. D'Alche Buc et al. |7| proposed SSMBOOST to handle 
unlabeled data within the margin cost functional optimiza- 
tion framework for boosting ifTTll . where the margin of an en- 
semble H on unlabeled data x is defined as either H{x)^ or 
Furthermore, SSMBooST requires the base learners 
to be semi-supervised algorithms themselves. Later, Bennett 
et al. |1| developed ASSEMBLE, which labels unlabeled 
data X by the current ensemble as y = sign[H{x)], and 
then iteratively puts the newly labeled examples into the 
original labeled set to train a new base classifier which is 
then added to H. Following the same margin cost functional 
optimization framework, Chen and Wang ||6| added a local 
smoothness regularizer to the objective function used by 
Assemble to help induce new base classifier with a more 
reliable self-labeling process. Other than the margin cost 
functional formalization, MCSSB flSl and SemiBoost flU 
estimate the labels of unlabeled instances by optimizing 
an objective function containing two terms. The first term 
encodes the manifold assumption that unlabeled instances 
with high similarities in input space should share similar la- 
bels, while the other term encodes the clustering assumption 
that unlabeled instances with high similarities to a labeled 
example should share its given label. The difference lies in 
that McsSB ifTSl implemented the objective terms based on 
Bregman divergence while SemiBoost II 161 implemented 
them with traditional exponential loss. 

A commonness of these existing semi-supervised ensem- 
ble methods is that they construct ensembles iteratively, 
and in particular, the unlabeled data are exploited through 



assigning pseudo-labels for them to enlarge labeled training 
set. Specifically, pseudo-labels of unlabeled instances are 
estimated based on the ensemble trained so far fl], Q, 
lfT4l . Il2n . or with specific form of smoothness or mani- 
fold regularization f6l, fTSl, ifTSl . After that, by regarding 
the estimated labels as their ground-truth labels, unlabeled 
instances are used in conjunction with labeled examples to 
update the current ensemble iteratively. 

Although various strategies have been employed to make 
the pseudo-labeling process more reliable, such as by incor- 
porating data editing (13], the estimated pseudo-labels may 
still be prone to error, especially in initial training iterations 
where the ensemble is only moderately accurate. In the next 
section we will present the Udeed approach. Rather than 
working with pseudo-labels to enlarge labeled training set, 
Udeed utihzes unlabeled data in a different way, i.e., help 
augment the diversity among base learners. 

III. The UDEED Approach 
A. General Formulation 

Let X = TZ^ be the d-dimensional input space and 
y = {—1,+!} be the output space. Suppose C = {{xi,yi)\ 
1 < i < L} contains L labeled training examples and 
U = {xi\L + l <i< L + U} contains U unlabeled training 
examples, where Xi X and i/i E y. In addition, we use 
C — {xi\l < i < L} to denote the unlabeled data set 
derived from C. 

We assume that the classifier ensemble is composed of 
ni base classifiers {/fe|l < k < to}, where each of them 
takes the form fk : X — ^ [—1,+!]. Here, the real value 
of fk{x) corresponds to the confidence of x being positive. 
Accordingly, {fk{x)-\-l)/2 can be regarded as the posteriori 
probability of being positive given x, i.e. P{y = +l|a;). 

The basic idea of Udeed is to maximize the fit of the clas- 
sifiers on the labeled data, while maximizing the diversity 
of the classifiers on the unlabeled data. Therefore, Udeed 
generates the classifier ensemble / = (/i, /2, ■ • ■ , fm) by 
minimizing the following loss function: 

y (/, £, V) = V,,np{f, /:) + 7 • Vd,y (/, V) (1) 

Here, the first term Vempif, ^) corresponds to the empirical 
loss of / on the labeled data set C; the second term 
Vdivif,'^) corresponds to the diversity loss of / on a 
specified data set V (e.g. V = U). Furthermore, 7 is the 
cost parameter balancing the importance of the two terms. 

In this paper, Udeed calculates the first term Vempif, ^) 
in Eq.([TJ as: 

m 

Vemp{f,C)^--y^l{fk,C) (2) 
TO 

fc=l 

Here, l{fk,jC-) measures the empirical loss of the fc-th base 
classifier fk on the labeled data set £. 



As shown in Eq.([T]l, the second term Vdiv{f,'D) is used 
to characterize the diversity among the based learners. 
However, it is well-known that diversity measurement is not 
a straightforward task since there is no generally accepted 
formal definition (12]. In this paper, Udeed chooses to 
calculate Vdivif,T^) in a novel way as follows; 
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and P{y = — l|a3) are modeled as 
respectively, BLll{fk{xi), yi) then takes the following form 
based on Eq.©: 

BLH(/fe(a;,),yO 
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where d{fpJq,V) = — 2_^fp{x)fq{x) 



(3) 



xev 



Here, |2?| returns the cardinality of data set V. Intuitively, 
d{fp, fq, V) represents the prediction difference between any 
pair of base classifiers on a specified data set I?Q In addition, 
the prediction difference is calculated based on the concrete 
output f{x) instead of the signed output sign[/(a;)]. In this 
way, the prediction confidence of each classifier other than 
the simple binary prediction is fully utilized. 

Then, Udeed aims to find the target model /* which 
minimizes the loss function in Eq.([T]l: 



arg minV{f ,£,!)) 



B. Logistic Regression Implementation 

In this paper, we employ logistic regression to implement 
the base classifiers. Specifically, each base classifier fk (1 < 
fc < m) is modeled as: 

fk{x) = 2 ■ gk{x) -1 = 2- — - 1 (5) 



1 



Here, gfc : A" [0, 1] is the standard logistic regression 
function with weight vector Wk £ TV^ and bias value € 
TZ. Without loss of generality, in the rest of this paper, hk is 
absorbed into Wk by appending the input space X with an 
extra dimension fixed at value 1. 

Correspondingly, the first term Vempif, £) in Eq.([T]) is 
set to be the negative binomial likelihood function on the 
labeled data set C, which is commonly used to measure the 
empirical loss of logistic regression: 



1 



k=l 



= ;;^-EE-BLH(/fe(a^.),2/.) 

k=l 1=1 

Here, the term BL}i{ fk{xi),yi) calculates the binomial 
likelihood of Xi having label yi, when fk serves as the classi- 
fication model. Note that the probabilities of P{y = 

'As reviewed in [12], most existing diversity measures are calculated 
based on the oracle (correct/incoiTect) outputs of base learners, i.e. the 
ground-truth labels of the data set are assumed to be known. However, 
considering that examples contained in the specified data set "D may be 
unlabeled, it is then infeasible to calculate d{fp, fgyV) by directly utilizing 
existing diversity measures. 
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Note that the first term Vempif, -C) can also be evaluated in 
other ways, such as k loss: Y>k=i J2f=i - Vif^ 

hinge loss: ^ J^T^i ULi 1 " y^fk{xi), etc. 

The target model /* is found by employing gradient 
t/eicenf-based techniques. Accordingly, the gradients of 
V{f,C,'D) with respect to the model parameters = 
{■Wfcll < k < to} are determined as follows]^ 



(4) 7^ 
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Y,fk'{x)-il~fk{xf)-x (7) 



To initialize the ensemble, each classifier fk is learned from 
a bootstrapped sample of C, namely Ck = {(a;f, < 
i < L}, hy conventional maximum likelihood procedure. 
Specifically, the corresponding model parameter Wk is ob- 
tained by minimizing the objective function i||t(;fc|p + 
A • X^i^i ^-BLH(/i:(a;*^), y*^). Here, A balances the model 
complexity and the binomial likelihood of on Ck- In 

^Note that under logistic regression implementation, the loss function 
V {f , C,T)) is generally non-convex, and the target model /* returned by 
the gradient descent process would coiTespond to a local optimal solution. 



Table I 

Characteristics of the data sets {d: dimensionality, pes.: #positive examples, #negative examples). 



data set 


d 


pos.lneg. 


data set 


d 


pos.lneg. 


data set 


d 


pos.lneg. 


data set 


d 


pos.lneg. 


data set 


d 


pos.lneg. 


diabetes 


8 


268/500 


vote 


16 


168/267 


ionosphere 


34 


255/126 


credit_g 


61 


300/700 


adult 


123 


7841/24720 


heart 


9 


120/150 


vehicle 


16 


218/217 


kr_vs_kp 


40 


1527/1669 


BCI 


117 


200/200 


web 


300 


1479/48270 


wdbc 


14 


357/212 


hepatitis 


19 


123/32 


isolet 


51 


300/300 


Digit 1 


241 


734/766 


ijcnnl 


22 


13565/128126 


austra 


15 


307/383 


labor 


26 


37/20 


sonar 


60 


111/97 


C0IL2 


241 


750/750 


cod-rna 


8 


110384/220768 


house 


16 


108/124 


ethn 


30 


1310/1320 


colic 


60 


136/232 


g241n 


241 


748/752 


forest 


54 


283301/297711 



this paper, A is set to the default value of 1. Note that 
the ensemble can also be initialized in other ways, such as 
instantiating each Wk with random values, etc. 

As shown in Eq.([T|), the second term Vdiv{f, 'D) regarding 
ensemble diversity is defined on a specified data set V. Given 
the labeled training set C and the unlabeled training set U, 
we consider three possibilities of instantiating V: 

» T> — %: No data is employed to measure the diversity 
among base learners {Vdiv{f :T^)=Q)- The resulting im- 
plementation is called Lc; 

• 2? = £: Labeled training examples are employed 
to measure the diversity among base learners, and 
the ensemble is optimized by exploiting only C. The 
resulting implementation is called LCD; 

» T> = U: Unlabeled training examples are employed 
to measure the diversity among base learners, and the 
ensemble is optimized by exploiting both C and 14. The 
resulting implementation is called LcUd; 

For Lc and LCD, after the ensemble is initialized, a 
series of gradient descent steps are performed to optimize 
the model by minimizing the loss function as 
defined in Eq.([T]l. For LcUd however, instead of directly 
minimizing V{f,C,'D) in the straightforward way of setting 
V = U, the loss function is firstly minimized by a series of 
gradient descent steps with T) = C. After that, by using 
the learned model as the starting point, a series of gradient 
descent steps are further conducted to finely search the 
model space with V ~ U. The purpose of this two-stage 
process is to distinguish the priorities of the contribution 
from labeled data and unlabeled data0 

For any gradient t/esce«f-based optimization process, it is 
terminated if either the loss function V{f,C,V) or the di- 
versity term Vdiv (/, 2?) does not decrease anymore. For each 
implementation of Udeed, the label of an unseen example z 

Similar strategies have b een adop ted by some successful semi- 
supervised ensemble methods 1161 , 1181 , where objective terms involving 
labeled data are given much higher weight than those involving unlabeled 
data. 



is predicted by the learned ensemble /* = (/j*, /|, • • • , /„) 
via weighted voting^ f*{z) = sign Efeli fki^)]- 

Intuitively, if the ensemble does benefit from the diver- 
sity augmented by the unlabeled training examples, LcUd 
should achieve superior performance than Lc and LCD. 

IV. Experiments 

In this section, comparative studies between Udeed (i.e. 
LcUd) and other semi-supervised ensemble methods are 
firstly reported. More importantly, experimental analysis on 
the three different implementations of Udeed are further 
conducted to show whether unlabeled data do benefit en- 
semble learning by helping augment the diversity among 
base learners. 

Twenty-five publicly-available binary data sets are used 
for experiments, whose characteristics are summarized in 
Table U Fifteen of them are from UCI Machine Learning 
Repository [2J, five from UCI KDD Archive |10], four from 
and one from ifTSll . Twenty regular-scale data sets (left 
four columns) as well as five large-scale data sets (right 
column) are included. The data set size varies from 57 to 
581,012, the dimensionality varies from 8 to 300, and the 
ratio between positive examples to negative examples varies 
from 0.031 to 3.844. 

For each data set, 50% of them are randomly selected 
to form the test set T, and the rest is used to form the 
training set of C\JU. The percentage of labeled data in 
training set (i.e. |£|/(|£| + \hl\)) is set to be 0.25. For each 
data set, 50 random C/U/T splits are performed. Hereafter, 
the reported performance of each method corresponds to the 
average result out of 50 runs on different splits. 

Various ensemble sizes (i.e. m) are considered in the 
experiments: a) to = 20 representing the case of small-scale 
ensemble; b) m ~ 50 representing the case of medium-scale 
ensemble; and c) m — 100 representing the case of large- 
scale ensembleO In addition, as shown in Eq.([T]i, the cost 

''Compared to unweighted voting where the label of z is predicted by 
f*{z) = sign sign[/^ (z)]] , the prediction confidence of each 

base learner could be fully utilized by weighted voting. 

^Preliminaiy experiments show that, as the ensemble size increases from 
10 to 100 within an interval of 100, the performance of UDEED does not 
significantly change within successive ensemble sizes and tends to converge 
as the ensemble size approaches 100. 



Table H 

Predictive ACCURACY (mean±std.) under small-scale ensemble size (m = 20). m/o indicates whether Udeed is statistically 

superior/inferior TO THE COMPARED ALGORITHM (PAIRWISE t-TEST AT 95% SIGNIFICANCE LEVEL). 



Algorithm 



Data Set 


Udeed 


Bagging 


AdaBoost 


Assemble 


SemiBoost 


diabetes 


0.726±0.021 


0.690±0.018» 


0.728±0.029 


0.700±0.031« 


0.695±0.019» 


heart 


0.793±0.040 


0.779±0.043» 


0.766±0.045« 


0.744±0.072« 


0.789±0.035 


wdbc 


0.927±0.014 


0.807±0.024» 


0.934±0.025 


0.898±0.070« 


0.793±0.028» 


austra 


0.834±0.023 


0.810±0.024« 


0.809±0.028« 


0.801±0.038« 


0.815±0.029» 


house 


0.921±0.028 


0.922±0.027 


0.849±0.156» 


0.921±0.036 


0.924±0.029 


vote 


0.932±0.017 


0.930±0.018» 


0.906±0.106 


0.928±0.019 


0.932±0.017 


vehicle 


0.916±0.019 


0.914±0.021 


0.916±0.064 


0.921 ±0.029 


0.886±0.026» 


hepatitis 


0.800±0.042 


0.792±0.026 


0.763±0.077« 


0.788±0.041 


0.796±0.026 


labor 


0.809±0.072 


0.801±0.074 


0.646±0.142« 


0.747±0.075« 


0.810±0.071 


ethn 


0.944±0.007 


0.942±0.008» 


0.934±0.013« 


0.939±0.010» 


0.929±0.009» 


ionosphere 


0.795±0.043 


0.721±0.023» 


0.807±0.037 


0.772±0.038» 


0.746±0.027» 


kr_vs_kp 


0.940±0.008 


0.938±0.008» 


0.941±0.009 


0.942±0.010 


0.936±0.008» 


isolet 


0.989±0.007 


0.988±0.006 


0.714±0.244« 


0.985±0.010« 


0.989±0.005 


sonar 


0.690±0.069 


0.690±0.070 


0.701±0.063 


0.672±0.068 


0.692±0.067 


colic 


0.777±0.035 


0.785±0.035o 


0.747±0.039« 


0.748±0.037« 


0.765±0.041» 


credit_g 


0.690±0.024 


0.710±0.019o 


0.678±0.023« 


0.686±0.025 


0.702±0.019o 


BCI 


0.582±0.039 


0.576±0.039» 


0.606±0.040o 


0.575±0.037 


0.569±0.049» 


Digit 1 


0.939±0.010 


0.940±0.009 


0.928±0.012« 


0.927±0.012« 


0.941±0.009o 


COIL2 


0.807±0.029 


0.809±0.028 


0.862±0.017o 


0.819±0.023o 


0.823±0.021o 


g241n 


0.793±0.020 


0.794±0.018 


0.760±0.021« 


0.751±0.020« 


0.791±0.022 


adult 


0.835±0.003 


0.844±0.002o 


0.840±0.003o 


0.843±0.002o 


N/A 


web 


0.981±0.001 


0.980±0.00U 


0.980±0.001« 


0.981±0.001o 


N/A 


ijcnnl 


0.914±0.001 


0.906±0.001» 


0.910±0.004« 


0.906±0.001« 


N/A 


cod-rna 


0.920±0.001 


0.850±0.001» 


0.945±0.003o 


0.851±0.002« 


N/A 


forest 


0.706±0.002 


0.703±0.002» 


0.736±0.006o 


0.696±0.002« 


N/A 


win/tie/loss 


/ 


13/9/3 


13/7/5 


14/8/3 


9/8/3 



parameter 7 is set to the default value of 1 . Note that better 
performance can be expected if certain strategies such as 
cross-validation are employed to optimize the value of 7. 

A. Comparative Studies 

In this subsection, Udeed (LcUd) is compared with two 
popular ensemble methods BAGGING |4| and AdaBoost 
||9l , and two successful semi-supervised ensemble methods 
Assemble (T) and SemiBoost |16|. For fair comparison, 
logistic regression is employed as the base learner of each 
compared method. For Udeed, the maximum number of 
gradient descent steps is set to 25 and the learning rate is set 
to 0.25. For the other compared methods, default parameters 
suggested in respective literatures are adopted. 

Tables to |IV] report the detailed experimental results 
under small-scale (r7i=20), medium-scale (m=50) and large- 
scale (m=100) ensemble sizes respectively. SemiBoost 
fails to work on the large-scale data sets, due to its de- 
manding storage complexity (©((l-C] + |Z^|)^)) to maintain 



the similarity matrix for the training examples. 

On each data set, the mean predictive accuracy as well 
as the standard deviation of each algorithm (out of 50 
runs) are recorded. Furthermore, to statistically measure the 
significance of performance difference, pairwise t-tests at 
95% significance level are conducted between the algo- 
rithms. Specifically, whenever Udeed achieves significantly 
better/worse performance than the compared algorithm on 
any data set, a win/loss is counted and a maker •/o is 
shown. Otherwise, a tie is counted and no marker is given. 
The resulting win/tie/loss counts for Udeed against the 
compared algorithms are highlighted in the last line of each 
table. 

In summary, when the ensemble size is small (Table |Il]l, 
Udeed is statistically superior to Bagging, AdaBoost, 
Assemble and SemiBoost in 52%, 52%, 56% and 45% 
cases, and is inferior to them in much less 12%, 20%, 
12% and 15% cases; When the ensemble size is medium 
(Table |lll]i, Udeed is statistically superior to Bagging, 



Table III 

Predictive accuracy (mean±std.) under medium-scale ensemble size (m = 50). •/o indicates whether Udeed is statistically 

superior/inferior TO THE COMPARED ALGORITHM (PAIRWISE t-TEST AT 95% SIGNIFICANCE LEVEL). 



Algorithm 



Data Set 


Udeed 


Bagging 


AdaBoost 


Assemble 


SemiBoost 


diabetes 


0.710±0.020 


0.691±0.019» 


0.731±0.026o 


0.699±0.032« 


0.696±0.019» 


heart 


0.794±0.033 


0.782±0.032» 


0.766±0.037« 


0.736±0.078« 


0.794±0.033 


wdbc 


0.885±0.017 


0.806±0.022» 


0.925±0.065o 


0.916±0.046o 


0.816±0.033» 


austra 


0.828±0.024 


0.812±0.028» 


0.808±0.025« 


0.815±0.036« 


0.816±0.029» 


house 


0.921±0.030 


0.920±0.030 


0.793±0.195» 


0.925±0.034 


0.924±0.029o 


vote 


0.931±0.017 


0.929±0.018» 


0.868±0.151« 


0.927±0.019 


0.932±0.017 


vehicle 


0.914±0.022 


0.914±0.021 


0.914±0.088 


0.919±0.025 


0.893±0.026» 


hepatitis 


0.796±0.031 


0.792±0.022 


0.737±0.106« 


0.785±0.045 


0.797±0.027 


labor 


0.813±0.083 


0.799±0.079» 


0.681±0.142« 


0.749±0.095« 


0.804±0.083 


ethn 


0.944±0.006 


0.942±0.007» 


0.937±0.013« 


0.939±0.011« 


0.931±0.009» 


ionosphere 


0.797±0.042 


0.722±0.022» 


0.814±0.035o 


0.783±0.027« 


0.748±0.028» 


kr_vs_kp 


0.939±0.008 


0.938±0.008» 


0.943±0.011o 


0.943±0.009o 


0.935±0.008» 


isolet 


0.989±0.006 


0.988±0.007» 


0.672±0.232« 


0.986±0.008« 


0.990±0.005 


sonar 


0.687±0.069 


0.690±0.072 


0.714±0.059o 


0.679±0.070 


0.696±0.068 


colic 


0.783±0.033 


0.783±0.036 


0.744±0.043« 


0.748±0.046« 


0.763±0.040» 


credit_g 


0.703±0.024 


0.711±0.020o 


0.674±0.026« 


0.689±0.025» 


0.703±0.019 


BCI 


0.582±0.041 


0.577±0.041 


0.620±0.043o 


0.583±0.051 


0.572±0.045» 


Digit 1 


0.941±0.010 


0.940±0.010 


0.929±0.012« 


0.925±0.012« 


0.941±0.009 


COIL2 


0.808±0.027 


0.812±0.024 


0.867±0.016o 


0.821±0.022o 


0.820±0.022o 


g241n 


0.796±0.019 


0.794±0.018 


0.762±0.023« 


0.750±0.020« 


0.791±0.022» 


adult 


0.842±0.002 


0.844±0.002o 


0.841±0.002« 


0.842±0.002o 


N/A 


web 


0.981±0.001 


0.980±0.00U 


0.980±0.001 


0.981±0.001o 


N/A 


ijcnnl 


0.907±0.001 


0.906±0.001» 


0.906±0.001« 


0.910±0.004o 


N/A 


cod-rna 


0.891±0.001 


0.851±0.001» 


0.945±0.003o 


0.851±0.003« 


N/A 


forest 


0.705±0.002 


0.703±0.002» 


0.737±0.006o 


0.698±0.003« 


N/A 


win/tie/loss 


/ 


14/9/2 


14/2/9 


13/6/6 


10/8/2 



AdaBoost, Assemble and SemiBoost in 56%, 56%, 
52% and 50% cases, and is inferior to them in much less 8%, 
36%, 24% and 10% cases; When the ensemble size is large 
(Table HVj, Udeed is statistically superior to Bagging, 
AdaBoost, Assemble and SemiBoost in 48%, 52%, 
52% and 40% cases, and is inferior to them in much 
less 8%, 40%, 20% and 15% cases. These results indicate 
that Udeed is highly competitive to the other compared 
methods. Roughly speaking, as for the time complexity, 
Udeed is slightly higher than Bagging and AdaBoost 
while fairly comparable to ASSEMBLE and SemiBoost. 

B. The Helpfulness of Unlabeled Data 

As motivated in Section U Udeed aims to exploit unla- 
beled data to help ensemble learning in the particular way 
of augmenting diversity among base learners. Therefore, in 
addition to the above comparative experiments with other 
(semi-supervised) ensemble methods, it is more important 
to show whether Udeed (LcUd) does achieve better per- 



formance than its counterparts (Lc and LCD) which do not 
consider using unlabeled data for diversity augmentation. 

Table |V] reports the performance improvement (i.e. in- 
crease of predictive accuracy) of LcUd against Lc and 
LCD under various ensemble sizes. On each data set, the 
mean improved predictive accuracy as well as the standard 
deviation (out of 50 runs) are recorded. In addition, to statis- 
tically measure the significance of performance difference, 
pairwise t-tests at 95% significance level are conducted. 
Specifically, whenever LcUd achieves significantly supe- 
rior/inferior performance than Lc or LCD on any data set, a 
win/loss is counted and a maker •/o is shown in the Table. 
Otherwise, a tie is counted and no marker is given. The 
resulting win/tie/loss counts for LcUd against Lc and LCD 
are highlighted in the last line of Table |Vl 

In summary, when the ensemble size is small, LcUd is 
statistically superior to Lc and LCD in 64% and 56% cases, 
and is inferior to them in both only 12% cases; When the 
ensemble size is medium, LcUd is statistically superior to 



Table IV 

Predictive accuracy (mean±std.) under large-scale ensemble size (m = 100). •/o indicates whether Udeed is statistically 

superior/inferior TO THE COMPARED ALGORITHM (PAIRWISE t-TEST AT 95% SIGNIFICANCE LEVEL). 



Algorithm 



Data Set 


Udeed 


Bagging 


AdaBoost 


Assemble 


SemiBoost 


diabetes 


0.700±0.020 


0.692±0.018» 


0.726±0.032o 


0.694±0.031 


0.696±0.018» 


heart 


0.790±0.035 


0.781±0.035» 


0.757±0.04U 


0.751±0.066« 


0.792±0.036 


wdbc 


0.852±0.021 


0.805±0.019» 


0.930±0.064o 


0.916±0.037o 


0.825±0.030» 


austra 


0.824±0.025 


0.812±0.024» 


0.806±0.027« 


0.808±0.038« 


0.817±0.028» 


house 


0.921±0.028 


0.921±0.029 


0.831±0.180» 


0.919±0.029 


0.924±0.029o 


vote 


0.930±0.017 


0.930±0.018 


0.902±0.104 


0.926±0.020 


0.932±0.017o 


vehicle 


0.913±0.022 


0.915±0.022 


0.930±0.026o 


0.911±0.031 


0.897±0.027» 


hepatitis 


0.797±0.027 


0.790±0.023» 


0.743±0.101« 


0.782±0.040« 


0.797±0.026 


labor 


0.811±0.080 


0.808±0.080 


0.683±0.146« 


0.756±0.098« 


0.809±0.075 


ethn 


0.943±0.007 


0.942±0.007 


0.938±0.012« 


0.939±0.011« 


0.932±0.008» 


ionosphere 


0.780±0.032 


0.721±0.023» 


0.812±0.037o 


0.779±0.042 


0.747±0.027» 


kr_vs_kp 


0.939±0.008 


0.938±0.007» 


0.945±0.011o 


0.944±0.008o 


0.935±0.008» 


isolet 


0.989±0.006 


0.989±0.006» 


0.616±0.208« 


0.984±0.012« 


0.990±0.005 


sonar 


0.690±0.071 


0.689±0.070 


0.713±0.061o 


0.679±0.063 


0.696±0.069 


colic 


0.784±0.033 


0.786±0.033 


0.741 ±0.04 !• 


0.745±0.051« 


0.763±0.042» 


credit_g 


0.706±0.021 


0.711±0.021o 


0.679±0.024« 


0.686±0.026» 


0.703±0.019 


BCI 


0.580±0.041 


0.578±0.042 


0.620±0.043o 


0.588±0.041 


0.572±0.046 


Digit 1 


0.940±0.009 


0.940±0.010 


0.927±0.013« 


0.925±0.011« 


0.941±0.009 


COIL2 


0.807±0.027 


0.811±0.024 


0.870±0.016o 


0.819±0.027o 


0.820±0.021o 


g241n 


0.795±0.018 


0.796±0.018 


0.760±0.023« 


0.754±0.027« 


0.792±0.022 


adult 


0.844±0.002 


0.844±0.002o 


0.840±0.002« 


0.843±0.002« 


N/A 


web 


0.981±0.001 


0.980±0.001» 


0.980±0.002 


0.981±0.001o 


N/A 


ijcnnl 


0.906±0.001 


0.905±0.004» 


0.906±0.001« 


0.906±0.001o 


N/A 


cod-rna 


0.873±0.001 


0.851±0.001» 


0.945±0.003o 


0.851±0.003« 


N/A 


forest 


0.705±0.002 


0.703±0.002» 


0.737±0.006o 


0.698±0.003« 


N/A 


win/tie/loss 


/ 


12/11/2 


13/2/10 


13/7/5 


8/9/3 



Lc and LCD in both 52% cases, and is inferior to them 
in both only 8% cases; When the ensemble size is large, 
LcUd is statistically superior to Lc and LCD in 52% and 
56% cases, and is inferior to them in only 8% and 12% cases. 
These results indicate that, by exploiting unlabeled data in 
the specific way of helping augment ensemble diversity, 
Udeed (LcUd) is capable of achieving better performance 
than its counterparts (Lc and LCD) which do not consider 
employing unlabeled in ensemble generation]^ 

C. Diversity Analysis 

To clearly verify that Udeed (LcUd) does increase the 
diversity among base learners after generating ensemble by 
utilizing unlabeled data, additional experiments are analyzed 
in this subsection based on several existing diversity mea- 
sures. Specifically, four diversity measures summarized in 

*Note that although in a number of cases the accuracy difference between 
two algorithms looks rather marginal (e.g. less than 1%), the difference may 
still be statistically significant according to the pairwise t-test. 



fT2) are considered, whose values are calculated based on 
the oracle (correct/incorrect) outputs of base learners. 

Suppose m denotes the number of base classifiers in the 
ensemble and N denotes the number of examples in the 
test set T. In addition, let O = [oij]mxN be the oracle 
output matrix. Here, Oij = 1 if the i-th base learner correctly 
classifies the j-th test example (l<i<m, l<j< N). 
Otherwise, Oy = 0. The formal definitions of the four 
diversity measures are as follows: 

• Disagreement measure (DIS): 

2 m— 1 m 

DIS = — — disifc, where 

mm — 1 ^ — ' ^ — ' 

^ ' 1=1 k=i+i 



Table V 

Accuracy improvement (mean±std.) for LCUd against LC and LCD under various ensemble sizes, mio indicates whether LCUd is 

STATISTICALLY superior/inferior TO THE COMPARED IMPLEMENTATION (PAIRWISE t-TEST AT 95% SIGNIFICANCE LEVEL). 



Accuracy Improvement of LcUd against 



Data Set 




Lc 






LCD 




m = 20 


m = 50 


m = 100 


m = 20 


m = 50 


m = 100 


diabetes 


0.034±0.024» 


0.019±0.013» 


0.008±0.011« 


0.011±0.012« 


0.009±0.009« 


0.004±0.007» 


heart 


0.023±0.027» 


0.009±0.016» 


0.006±0.013« 


0.009±0.016« 


0.003±0.010« 


0.004±0.009» 


wdbc 


0.127±0.024» 


0.075±0.012» 


0.047±0.013« 


0.033±0.014« 


0.031±0.013« 


0.023±0.008» 


austra 


0.022±0.022» 


0.015±0.013» 


0.010±0.008« 


0.004±0.012« 


0.006±0.008« 


0.005±0.005» 


liouse 


0.003±0.010» 


-0.001 ±0.005 


0.001±0.004« 


0.002±0.007« 


0.000±0.004 


0.001±0.003» 


vote 


0.002±0.005» 


0.001±0.003» 


0.001±0.003« 


0.001±0.004 


0.001±0.002« 


0.001±0.001» 


vehicle 


0.005±0.010» 


0.002±0.005 


0.001±0.004 


0.003±0.007« 


0.001±0.005 


0.001±0.004 


hepatitis 


0.010±0.035 


0.005±0.027 


0.008±0.017« 


0.003±0.027 


0.001±0.019 


0.005±0.012. 


labor 


0.003±0.071 


0.004±0.043 


0.004±0.018 


-0.007±0.041 


0.007±0.032 


0.004±0.012» 


ethn 


0.002±0.003» 


0.001 ±0.002. 


0.001±0.002« 


0.001±0.002« 


0.001±0.001« 


0.001±0.001» 


ionosphere 


0.073±0.049» 


0.076±0.049» 


0.057±0.035« 


0.015±0.034« 


0.022±0.032« 


0.029±0.024» 


kr_vs_kp 


0.002±0.003» 


0.001 ±0.002. 


0.001±0.001« 


0.001±0.001« 


0.001±0.001« 


0.001±0.001» 


isolet 


0.001±0.003» 


0.001±0.002» 


0.001±0.002 


0.001±0.002 


0.001±0.001« 


0.001±0.001 


sonar 


0.001±0.036 


0.003±0.022 


0.001±0.015 


0.002±0.016 


-0.001±0.014 


0.001±0.011 


colic 


-0.006±0.014o 


-0.003±0.012 


-0.001±0.008 


-0.003±0.010o 


-0.003±0.009 


0.001±0.006 


credit_g 


-0.019±0.017o 


-0.008±0.010o 


-0.005±0.008o 


-0.009±0.010o 


-0.004±0.006o 


-0.002±0.006o 


BCI 


0.006±0.015» 


0.003±0.010 


0.002±0.012 


0.005±0.010« 


0.002±0.010 


0.002±0.011 


Digit! 


0.001±0.005 


0.001 ±0.002 


0.001 ±0.004 


0.001±0.005 


0.001±0.002 


0.001±0.003 


C0IL2 


-0.001±0.016 


-0.004±0.016 


-0.003±0.015 


0.001±0.005 


-0.001±0.006 


-0.002±0.007o 


g241n 


0.001±0.005 


0.001 ±0.004 


-0.001 ±0.004 


-0.001±0.004 


0.001±0.004 


-0.001±0.004 


adult 


-0.009±0.002o 


-0.002±0.002o 


-0.001 ±0.00 lo 


-0.006±0.001o 


-0.002±0.001o 


-0.001±0.001o 


web 


0.001 ±0.00 !• 


0.001 ±0.00 !• 


0.000±0.000 


0.001±0.001« 


0.001±0.001« 


0.000±0.000 


ijcnnl 


0.008±0.001» 


0.001 ±0.00 U 


0.001±0.001« 


0.006±0.001« 


0.001 ±0.00 !• 


0.001±0.001» 


cod-ma 


0.069±0.001» 


0.041±0.001» 


0.023±0.001« 


0.022±0.001« 


0.018±0.001« 


0.011±0.001» 


forest 


0.003±0.001» 


0.002±0.00U 


0.001±0.001« 


0.001 ±0.00 !• 


0.001 ±0.00 !• 


0.001±0.001» 


win/tie/loss 


16/6/3 


13/10/2 


13/10/2 


14/8/3 


13/10/2 


14/8/3 



• Double-fault measure (DF): 
2 



m— 1 m 



DF = — — V V AUk, where 

mm — 1 ^ — ' ^ — ' 

^ ' i=i k=i+i 



Ef=i(l-Oy)-(l-Ofej) 
dtjfc = 



N 



Entropy measure (ENT): 



ENT = 1 y 

N p-^m- \m/2] 



Coincident failure diversity (CFD): 



min< 2^ o,y ,m- 2^ Oy 



i=l 



CFD 



0, Po = 1.0 



where 



N 



, (0 < « < m) 



Here, DIS and DF are pairwise measures while ENT and 
CFD are non-pairwise measures. In addition, 1-DF is used 
instead of DF such that for all the measures, the greater the 
value the higher the diversity. All the four measures vary 
between and 1. 

Table IVII compares Udeed's initial diversity after ensem- 
ble initialization with its final diversity after ensemble learn- 
ing under various ensemble sizes. For each data set, pairwise 
t-tests at 95% significance level are conducted between the 
initial and the final ensemble diversities. Whenever the final 



Table VI 

The win/tie/loss results for FINAL ensemble against INITIAL ensemble in terms of the four diversity measures under various 

ENSEMBLE SIZES. 



FINAL ensemble vs. INITIAL ensemble 



Data Set 




7TL — 


20 






772 


= 50 








100 




DIS 


DF 


ENT 


CFD 


DIS 


DF 


ENT 


CFD 


DIS 


DF 


ENT 


CFD 


Ul a. Lit L\j O 




























loss 


win 


loss 


tie 


loss 


win 


loss 


loss 


loss 


win 


loss 


loss 




tie 




tie 


tie 


tie 


tie 


tie 


tie 


tie 




tie 


tie 


alXalla 


loss 


win 


loss 


tip 
Lie 


loss 


win 


loss 


tip 
Lie 


loss 


win 


loss 


loss 




























vote 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


vpViipIp 


tie 


tie 


tie 


tie 


loss 


tie 


tie 


tie 


win 


tie 


win 


tie 




win 


tie 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 




tie 


tie 


tie 


tie 








tie 








tie 


ethn 










iTss 


tie 


tie 


tie 








tie 


iUIlUa unci C 


























^cv v; ^cn 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


isolet 


win 


lie 


win 


Lie 


win 


loss 


win 


Lie 


win 


loss 


win 


Lie 


soniir 




tie 








tie 




tie 




tie 




tie 














tie 




tie 




tie 




tie 


credit g 


win 


loss 


win 


win 


win 


loss 


win 


win 


win 


loss 


win 


win 


BCI 


























Digit 1 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


C0IL2 


win 


win 


win 


win 


tie 


win 


tie 


win 


tie 


win 


tie 


win 


g241n 


tie 


loss 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


loss 


tie 


tie 


adult 


win 


loss 


win 


win 


win 


loss 


win 


win 


win 


win 


win 


win 


web 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


win 


ijcnnl 


loss 


loss 


loss 


loss 


loss 


loss 


loss 


loss 


loss 


loss 


loss 


loss 


cod-rna 


tie 


win 


tie 


win 


tie 


win 


tie 


tie 


win 


win 


tie 


tie 


forest 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


tie 


win/tie/loss 


15/6/4 


14/6/5 


15/6/4 


15/8/2 


14/5/6 


14/7/4 


14/7/4 


12/11/2 


17/4/4 


17/4/4 


16/5/4 


12/10/3 



ensemble achieves significantly higher/lower diversity than 
the initial one, a win/loss is recorded. Otherwise, a tie is 
recorded. The resulting win/tie/loss counts are highlighted 
in the last line of Table [VI] 

In summary, when the ensemble size is small, Udeed 
statistically increases the initial ensemble diversity in 60% 
(DIS), 56% (DF), 60% (ENT) and 60% (CFD) cases, but 
decreases the initial ensemble diversity in only 16% (DIS), 
20% (DF), 16% (ENT) and 8% (CFD) cases. 

When the ensemble size is medium, Udeed statistically 
increases the initial ensemble diversity in 56% (DIS), 56% 
(DF), 56% (ENT) and 48% (CFD) cases, but decreases the 
initial ensemble diversity in only 24% (DIS), 16% (DF), 
16% (ENT) and 8% (CFD) cases; 

Finally, when the ensemble size is large, Udeed statisti- 
cally increases the initial ensemble diversity in 68% (DIS), 



68% (DF), 64% (ENT) and 48% (CFD) cases, but decreases 
the initial ensemble diversity in only 16% (DIS), 16% (DF), 
16% (ENT) and 12% (CFD) cases. 

These results clearly verify that Udeed can effectively 
exploit unlabeled data to help augment ensemble diversity. 

V. Conclusion 

Previous ensemble methods try to obtain a high accuracy 
of base learners and high diversity among base learners 
by considering only labeled data. There were some studies 
on using unlabeled data, but focusing on using unlabeled 
data to improve accuracy. The major contribution of our 
work is to use unlabeled data to augment diversity, which 
suggests a new direction for ensemble design. Specifically, 
a novel semi-supervised ensemble method named Udeed is 
proposed, which works by maximizing accuracy on labeled 



data while maximizing diversity on unlabeled data. 

Experiments siiow tiiat: a) Udeed achieves highly compa- 
rable performance against other successful semi-supervised 
ensemble methods; b) Udeed does benefit from unlabeled 
data by using them to augment the diversity among base 
learners. In the future, it is interesting to see whether Udeed 
works well with other base learners. It would be insightful 
to analyze why Udeed can achieve good performance the- 
oretically. Furthermore, designing other ensemble methods 
by exploiting unlabeled data to augment ensemble diversity 
gracefully is a direction very worth studying. 
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